This is part of the documentation for UWOT.
December 29 2018 New, better settings for t-SNE, better plots and a couple of new datasets. Removed neighborhood preservation values until I’ve double checked they are working correctly.
Here are some examples of the output of uwot’s implementation of UMAP, compared to t-SNE output. As you will see, UMAP’s output results in more compact, separated clusters compared to t-SNE.
For details on the datasets, follow their links. Somewhat more detail is also given in the smallvis documentation. iris you already have if you are using R. s1k is part of the sneer package. frey, oli, mnist, fashion, kuzushiji, norb and cifar10 can be downloaded via snedata. coil20 and coil100 can be fetched via coil20.
mnist <- snedata::download_mnist()
# For some functions we need to strip out non-numeric columns and convert data to matrix
x2m <- function(X) {
if (!methods::is(X, "matrix")) {
m <- as.matrix(X[, which(vapply(X, is.numeric, logical(1)))])
}
else {
m <- X
}
m
}
At the time I generated this document (late December 2018), the kuzushiji dataset had some duplicate and all-black images that needed filtering. This seems to have been remedied as of early February 2019. I re-ran both UMAP and t-SNE on the fixed dataset, but the results weren’t noticeably different. For the record, the clean-up routines I ran were:
# Remove all-black images in Kuzushiji MNIST (https://github.com/rois-codh/kmnist/issues/1)
kuzushiji <- kuzushiji[-which(apply(x2m(kuzushiji), 1, sum) == 0), ]
# Remove duplicate images (https://github.com/rois-codh/kmnist/issues/5)
kuzushiji <- kuzushiji[-which(duplicated(x2m(kuzushiji))), ]
For UMAP, I stick with the defaults, with the exception of iris, coil20, and coil100 and norb. The spectral initialization with the default n_neighbors leads to disconnected components, which can lead to a poor global picture of the data. The Python UMAP implementation goes to fairly involved lengths to ameliorate theses issues, but uwot does not.
For these datasets, a perfectly good alternative that provides a global initialization is to use the first two components from PCA, scaled so their standard deviations are initially 1e-4 (via init = "spca"). This usually results in an embedding which isn’t too different from starting via the raw PCA but is more compact, i.e. less space between clusters. For visualizations, as long as the relative orientation and rough distances between clusters are maintained, the exact distances between them are not that interesting to me.
Dimensionality was reduced to 100 by applying PCA to the data. It’s commonly applied to data before or as part of t-SNE, so to avoid any differences from this pre-processing, I applied PCA to the UMAP data. I used 100 components, rather than the usual 50 just to be on the safe side. This seemed to have no major effect on the resulting visualizations.
# For iris
iris_umap <- umap(iris, init = "spca")
# Small datasets (s1k, oli, frey)
s1k_umap <- umap(s1k)
# Big datasets (mnist, fashion, kuzushiji)
mnist_umap <- umap(mnist, pca = 100)
# norb, coil20 and coil100
coil20_umap <- umap(coil20, pca = 100, init = "spca")
The Rtsne package was used for the t-SNE calculations, except for the iris dataset, proving troublesome once again. This time it’s because Rtsne doesn’t allow for duplicates. For iris only, I used the smallvis package.
For t-SNE, I also employ the following non-defaults:
perplexity = 15, which is closer to the neighborhood size used by UMAPinitial_dims = 100, which applies PCA to the input, keeping the first 100 components, rather than the usual 50 to minmize distortion from the initial PCA.uwot also makes use of irlba, there is no reason not to use partial_pca.It would be nice to use the same initial coordinates for both methods, but unfortunately Rtsne doesn’t apply early exaggeration with user-supplied input. Without early exaggeration, t-SNE results aren’t as good, especially with larger datasets. Therefore the t-SNE plots use a random initialization.
# For iris only
iris_tsne <- smallvis::smallvis(iris, perplexity = 15, Y_init = "rand", exaggeration_factor = 4)
# Small datasets (s1k, oli, frey)
s1k_tsne <- Rtsne::Rtsne(x2m(s1k), perplexity = 15, initial_dims = 100,
partial_pca = TRUE, exaggeration_factor = 4)
# Big datasets (coil20, coil100, mnist, fashion, kuzushiji)
mnist_tsne <- Rtsne::Rtsne(x2m(mnist), perplexity = 15, initial_dims = 100,
partial_pca = TRUE, exaggeration_factor = 12)
For visualization, I used the vizier package. The plots are colored by class membership (there’s an obvious choice for every dataset considered), except for frey, where the points are colored according to their position in the sequence of images.
embed_img <- function(X, Y, k = 15, ...) {
args <- list(...)
args$coords <- Y
args$x <- X
do.call(vizier::embed_plot, args)
}
embed_img(iris, iris_umap, pc_axes = TRUE, equal_axes = TRUE, alpha_scale = 0.5, title = "iris UMAP", cex = 1)
For UMAP, where non-default initialization was used, it’s noted in the title of the plot (e.g. “(spca)”).
The standard iris dataset, known and loved by all.
A 9-dimensional fuzzy simplex, which I created for testing t-SNE and related methods, original in the sneer package.
Images of Brendan Frey’s face, as far as I know originating from a page belonging to Saul Roweis.
Yet more faces, this time the dataset used in Isomap, consisting of images of the same face under different rotations and lighting conditions. Unfortunately, it’s no longer available at the MIT website, but it can be found via the Wayback Machine. I wrote a gist for processing the data in R.
In the images below the points are colored by the first pose angle.
The Swiss Roll data used in Isomap. A famous dataset, but perhaps not that representative of typical real world datasets. t-SNE is know to not handle this well, but UMAP makes an impressive go at unfolding it.
The COIL-100 Columbia Object Image Library.
The UMAP results are rather hard to visualize on a static plot. If you could pan and zoom around, you would see that the rather indistinct blobs are mainly correctly preserved loops. This is an example where choosing a different value of the a and b parameters would be a good idea. The t-UMAP variant, available as tumap uses a = 1, b = 1 (effectively using the same Cauchy kernel as t-SNE) does a better job here and is shown below on the left. The correct choice of parameters is also important for t-SNE. On the right is the t-SNE result with a default perplexity = 50, which does not retain as much of the loop structure as the lower perplexity:
The Swiss Roll data used in Isomap. A famous dataset, but perhaps not that representative of typical real world datasets. t-SNE is know to not handle this well, but UMAP makes an impressive go at unfolding it.
The small NORB dataset, pairs of images of 50 toys photographed at different angles and under different lighting conditions.
The CIFAR-10 dataset, consisting of 60000 32 x 32 color images evenly divided across 10 classes (e.g. airplane, cat, truck, bird). t-SNE was applied to CIFAR-10 in the Barnes-Hut t-SNE paper, but in the main paper, only the results after passing through a convolutional neural network were published. t-SNE on the original pixel data was only given in the supplementary information (PDF) which is oddly hard to find a link to via JMLR or the article itself.
Yikes. What is going on with the UMAP result? I see the same thing in the Python UMAP implementation (although maybe less pronounced), so I don’t think this is a bug in uwot. There is an outlying cluster of automobile images in the bottom left, which seem to be variations on the same image. The same cluster is present in the t-SNE plot (also in the bottom left), but is comfortably close to the rest of the data.
The existence of these near-duplicates in CIFAR-10 doesn’t seem to have been widely known or appreciated until quite recently, see for instance this twitter thread and this paper by Recht and co-workers. Such manipulations are in line with the practice of data augmentation that is popular in deep learning, but you would need to be aware of it to avoid the test set results being contaminated. These images seem like a good argument for applying UMAP or t-SNE to your dataset as a way to spot this sort of thing.
We can get better results with UMAP by using the scaled PCA initialization (below on the left) and by using the t-UMAP settings (below, right):
That rogue cluster is still present (off to the lower right now), but either image is more comparable to the t-SNE result.
The tasic2018 dataset is a transcriptomics dataset of mouse brain cell RNA-seq data from the Allen Brain Atlas (originally reported by Tasic and co-workers. There is gene expression data for 14,249 cells from the primary visual cortex, and 9,573 cells from the anterior lateral motor cortex to give a dataset of size n = 23,822 overall. Expression data for 45,768 genes were obtained in the original data, but the dataset used here follows the pre-processing treatment of Kobak and Berens which applied a normalization and log transformation and then only kept the top 3000 most variable genes.
The data can be generated from the Allen Brain Atlas website and processed in Python by following the instructions in this Berens lab notebook. I output the data to CSV format for reading into R and assembling into a data frame with the following extra exporting code:
np.savetxt("path/to/allen-visp-alm/tasic2018-log3k.csv", logCPM, delimiter=",")
np.savetxt("path/to/allen-visp-alm/tasic2018-areas.csv", tasic2018.areas, delimiter=",", fmt = "%d")
np.savetxt("path/to/allen-visp-alm/tasic2018-genes.csv", tasic2018.genes[selectedGenes], delimiter=",", fmt='%s')
np.savetxt("path/to/allen-visp-alm/tasic2018-clusters.csv", tasic2018.clusters, delimiter=",", fmt='%d')
np.savetxt("path/to/allen-visp-alm/tasic2018-cluster-names.csv", tasic2018.clusterNames, delimiter=",", fmt='%s')
np.savetxt("path/to/allen-visp-alm/tasic2018-cluster-colors.csv", tasic2018.clusterColors, delimiter=",", fmt='%s')
Again, the default UMAP settings produce clusters that are bit too well-separated to clearly see in these images. So here are the results from using t-UMAP (the same as setting a = 1, b = 1), on the left, and then using a = 2, b = 2 on the right: